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Abstract 

We consider the problem of unveiling the implicit network structure of user interactions 
in a social network, based only on high-frequency timestamps. Our inference is based on 
the minimization of the least-squares loss associated with a multivariate Hawkes model, 
penalized by i\ and trace norms. We provide a first theoretical analysis of the generalization 
error for this problem, that includes sparsity and low-rank inducing priors. This result 
involves a new data-driven concentration inequality for matrix martingales in continuous 
time with observable variance, which is a result of independent interest. A consequence of 
our analysis is the construction of sharply tuned l\ and trace-norm penalizations, that leads 
to a data-driven scaling of the variability of information available for each users. Numerical 
experiments illustrate the strong improvements achieved by the use of such data-driven 
penalizations. 


1 Introduction 

Understanding the dynamics of social interactions is a challenging problem of fastly growing 
interest [11, 20, 9, 21] because of the large number of applications in web-advertisement and e- 
commerce, where large-scale logs of event history are available. A common supervised approach 
consists in the prediction of labels based on declared interactions (friendship, like, follower, etc.) 
However such supervision is not always available, and it does not always describe accurately the 
level of interactions between users. Labels are often only binary while a quantification of the 
interaction is more interesting, declared interactions are often deprecated, and more generally a 
supervised approach is not enough to infer the latent communities of users, as temporal patterns 
of actions of users are much more informative. 

A recent set of papers [32, 14, 10] consider an approach for recovering latent social groups 
directly based on the real actions or events of users, called also nodes in the following, that 
uses only the timestamps patterns of the considered events. The models assume a structure of 
data consisting in a sequence of independent cascades, containing timestamps for each nodes. 
In these works, techniques coming from survival analysis are used to derive a tractable convex 
likelihood, that allows to infer the latent community structure. However, this model requires 
that data is already segmented into sets of independent cascades, which is not always realistic. 
Moreover, it does not allow for recurrent events, namely a node can be infected only once, and 
it cannot incorporate exogeneous factors, namely influence from the world outside the network. 
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Another approach is based on self-exciting point processes, such as the Hawkes process [16]. 
Previously used for geophysics, [28], high-frequency finance [1], crime activity [26], this model 
has been also recently used for the modelization of users activity in social networks, see for 
instance [9, 6, 36, 35]. The main point is that the structure of the Hawkes model allows to 
capture the direct influence of a user’s action to the others, based on the recurrency and the 
patterns of actions timestamps. It encompasses in the same likelihood the decay of the influence 
over time, the levels of interaction between nodes, which can be seen as a weighted asymmetrical 
adjacency matrix, and a baseline intensity, that measures the level of exogeny of a user, namely 
the spontaneous apparition of an action, with no influence from other nodes of the network. 

In this paper, we consider a multivariate Hawkes process (MHP), and we combine convex 
proxies for sparsity and low-rank of the adjacency matrix and the baseline intensities, that are 
now of common use in low-rank modeling in collaborative filtering problems [7, 8] . Note that this 
approach is also considered in [36]. We provide a first theoretical analysis of the generalization 
error for this problem, see [15] for an analysis including only entrywise d-i penalization. Namely, 
we prove a sharp oracle inequality for our procedure, that includes sparsity and low-rank in¬ 
ducing priors, see Theorem 1 in Section 4. This result involves a new data-driven concentration 
inequality for matrix martingales in continuous time, see Theorem 3 in Section 5, which is a 
result of independent interest, that extends previous non-commutative versions of concentration 
inequalities for martingales in discrete time, see [33]. A consequence of our analysis is the con¬ 
struction of sharply tuned and trace-norm penalizations, that leads to a data-driven scaling 
of the variability of information available for each nodes. We give empirical evidence of the 
improvements of our data-driven penalizations, by conducting in Section 6 numerical experi¬ 
ments on simulated data. Since the objectives involved are convex with a smooth component, 
our algorithms build upon standard accelerated batch gradient proximal algorithms. 

2 The multi vat riate Hawkes model 

Consider a finite network with d nodes (each node corresponding to a user in a social net¬ 
work for instance). For each node j G {!,..., d}, we observe the timestamps ■ ■ ■} 

of actions of node j on the network (a message, a click, etc.). To each node j is associated 
a counting process Nj{t) = we consider the d-dimensional counting process 

At = [Ai(t) • • • Nd(t)]^ G N'^, for t > 0. We observe this process for t G [0,T]. Each Nj has an 
intensity \j, meaning that 

P(iVj has a jump in [t, t + dt] \ Ft) = Xj{t)dt, j = 1,..., d, 

where Ft is the u-field generated by A up to time t. The multivariate Hawkes model assumes 
that each Nj has an intensity Xj^ given by 

d n 

Xj^9(j') — f^j T ^ ^ s)dNjf(^s)^ 

j'=i NOjt) 

where the integral is a Stieljes integral, namely 



where 6 = (/i,A) with /i = [ni, ..., A = [aj,j']i<j,j'<d^ with /tj > 0 which is the 

baseline intensity of j, where ajj> > 0 is a coefficient that quantifies the influence of j' on j, and 
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hjj' : M+ M+ are decay functions that account for the decay of influence between pairs of 
nodes in the network. A typical choice for hjj/ is the exponential kernel, i.e., hjj>{t) = 
where ajj> > 0 is a decay coefficient. We consider these functions fixed and known in this 
paper. The parameter of interest is the self-excitement matrix A, which can be viewed as a 
weighted asymmetrical adjacency matrix of connectivity between nodes. 
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Figure 1: Toy example with d = 10 nodes. Based on actions’ timestamps of the nodes, rep¬ 
resented by vertical bars (left figure), we aim at recovering the matrix A of implicit influence 
between nodes (right figure). 

The Hawkes model is particularly revelant for the modelization of the “microscopic” activity 
of social networks, and has been considered recently a lot in literature (see [9, 6, 36, 35, 23, 12, 6, 
17], among others) for this kind of application, with a particular emphasis on [15] that gives first 
theoretical results for the Lasso used with Hawkes processes with an application to neurobiology. 
The main point is that this simple autoregressive structure of the intensity allows to capture 
the direct influence of a user on to the others, based on the recurrency and the patterns of 
their actions, by separating the intensity into a baseline and a self-exciting component, hence 
allowing to filter exogeneity in the estimation of users’ influences on each others. 


3 The procedure 


We want to produce an estimation procedure of 0 = (/r. A) based on data from {W : t € [0,T]}. 
The hidden structure underlying the observed actions of nodes will be contained in A. A way 
of achieving this is to minimize the least-squares functional given by 


Rt(0) 


l|A.||^ 



Xjfi{t)dNj{t) 


( 1 ) 


with respect to 6, where ||Ae|||. = ^ /[ot] is the norm associated with the inner 

product 


(Ae, Xe')T 



^j,e{t)Xj^g:{t)dt. 


(2) 


This least-squares function is very natural, and comes from the empirical risk minimization 
principle [34, 25, 18, 3]: assuming that N has an unknown ground truth intensity A (not 
necessarily following the Hawkes model), we have easily, using Doob-Meyer’s decomposition 
that 

E[iiT(0)] = EllAell^ - 2E(Ae, A)t = E||A, - A||^ - ||A||t, 

so that we expect a minimium 9 of Rt{9) to lead to a good estimation A^ of A. 

In addition to this goodness-of-fit criterion, we need to use a penalization that allows to 
reduce the dimensionality of the model. In particular, we want to reduce the dimensionality of 
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A, based on the prior assumption that latent factors explain the connectivity of users in the 
network. This leads to a low-rank assumption on A, which is commonly used in collaborative 
filtering and matrix completion techniques [30]. Our prior assumptions on and A are the 
following. 

Sparsity of //. Some nodes are basically inactive and react only if stimulated. Hence, we 
assume that the baseline vector is sparse. 

Sparsity of A. A user interacts only with a fraction of other nodes, meaning that for a fixed 
node j, only a few ajji are non-zero. Hence, we assume that A is a sparse matrix 

Low-rank of A. Nodes interactions have a community structure. It contains cliques, leading 
to a block-diagonal adjacency matrix that has the property of being sparse and low-rank. 

To induce these prior assumptions on the parameters, we use a penalization based on a 
mixture of the ii and trace norms. These norms are respectively the tightest convex relaxations 
for sparsity and low-rank, see for instance [7, 8]. They provide state-of-the art results in com¬ 
pressed sensing and collaborative filtering problems, among many other problems. These two 
norms have been previously combined for the estimation of sparse and low-rank matrices, see 
for instance [31] and [36] in the context of MHP. We consider indeed the following penalization 
on the parameter 6 = (p, A): 


pen(6l) = \\n\\i,w + 11A11^_^ -bfllA]]*, (3) 

where each terms are weighted ii and trace norm penalizations, given by 

d d 

||/^||l,'u) ~ ~ ^ ^ j,k\0‘j,k\j II^11* ~ 

j=l ^<j,k<d j=l 

where the (Ti(A) > ••• > (JdiA) are the singular values of A. The weights w, W, and the 
coefficient r are data-driven tuning parameters described below. The choice of these weights 
comes from a sharp analysis of the noise terms, see Section 5 below, and they lead to a data- 
driven scaling of the variability of information available for each nodes. The set of matrices A 
obtained by minimizing an objective penalized by (3) contains matrices that can be written in 
a block-diagonal or overlapping block-diagonal form, up to permutations of rows and columns. 
We consider then 

0G argmin + pen(0)|, (4) 

which is a solution to the penalized least-squares problem. 

Let us define now the data-driven weights w, W and f used in (3). From now on, we fix 
some confidence level x > 0, which corresponds to the probability that the oracle inequality 
from Theorem 1 holds. This can be safely chosen as x = logd for instance. The weights for 
^i-penalization of /r are given by 

^^ / (x + logd + UT))Ar,([0.T|)/r ^ 3 ^ + logd + 4,(r) 

where A[j ([0,r]) = dNj{t) and ix,j{T) = 2 log log y weighting of each 

coordinate j in the penalization of /r is natural: it is roughly proportional to the square-root 
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of Nj([0,T)]/T, which is the average intensity of events on coordinate j. The term ixj{T) is 
a technical term, that can be neglicted in practice, see Section 6. The data-driven weights for 
•^i-penalization of A are given by 


= 

where 




{x + 2logd + Lx,j,k{T))Bj^k{T) 
T 


= sup iJj-fc(s), Hj^kit)= hj^kit-s)dNk{s), 
se[o,t] J{o,t) 

1 

Vj^k{t) = - 

^ ./o 


and where 


Lx,j,k{t) = 2 log log 


hj k{s — u)dNk{u) 1 dNj{s 

iO,s) ’ ^ 

QtVj,k{t) + 


Ve . 


(7) 


( 8 ) 


Once again, this is natural: the variance term Vj^k{t) iu (7) is, roughly, an estimation of the 
variance of the self-excitements between coordinates j and k. The term L^j^kiT) is a technical 
term that can be neglicted in practice. 

The coefficient r comes from a new concentration inequality for matrix-martingales in con¬ 
tinuous time, see Theorem 3 in Section 5 below. We consider indeed 


r = 


'(x + logd + 4(T))(||Fi(r)||opV||y2(T)|| 


lop; 


2{x -b logd 4(T))(10.34 -b 2.65supig[o^T] \\H{t)\\ 2 ,oo) 


(9) 


-b 


lop 


stands for the operator norm, namely the largest singular value, where H(t) is the 


where 

matrix with entries Hj^k{t) given in (7), where Vi{t) is the diagonal matrix with entries 

= ll\\H{s)\\l^dNj{s), 

and where V 2 {t) is the matrix with entries 

{V2{t))j,k — / l|-f^('S)||2,oo ^2-,, II rr /„\||2 dNl{s), 

^ JO l|-"b»vSl|l2 

where || • 4 is the 4-uorm, ||f^|| 2 ,oo is the maximum 4 norm of the rows of X, and where Hi , 
is the /-th row of H. The extra technical term ix{t) is given by 

^2||Fi(t)||op -b2(4-bsup^g[o,t] \\H{s)\\l^/3)x 


( 10 ) 


( 11 ) 


£x{t) = 2 log log 


Ve 


-b 2 log log 


2||F2(t)||op + 2(4 -b sup^g[o,j] \\H{s)\\l^/3)x 


V e 


( 12 ) 


-b 2 log log ( sup ||iT(s)|| 2 ,oo 
■ se[o,t] 


Ve . 


These weights are actually quite natural: the terms Vj^kit), ||^i(t)||op and ||V 2 (i)||op corre¬ 
spond to estimations of the noise variance, that are the terms appearing in the empirical 
Bernstein’s inequalities given in Section 5 below. This will allow for a sharp tuning of the pe¬ 
nalizations. The terms and sup^gjQ^^] ||ii'(s)|| 2 ,oo correspond to the L°° terms from these 

Bernstein’s inequalities. 
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4 A sharp oracle inequality 


Recall that the inner product (Ai,A 2 )t is given by (2) and recall that || • ||t stands for the 
corresponding norm. Theorem 1 is a sharp oracle inequality on the prediction error measured 
by For the proof of oracle inequalities with a fast rate, one needs a restricted 

eigenvalue condition on the Gram matrix of the problem [5, 18]. One of the weakest assumptions 
considered in literature is the Restricted Eigenvalue (RE) condition. In our setting, a natural 
RE assumption is given in Dehnition 1 below. We denote by || • \\f the Erobenius norm. If 
X = U'EV~^ is the SVD of X, with the columns Uj of U and of V being, respectively, the 
orthonormal left and right singular vectors of X, the projection matrix onto the space spanned 
by the columns (resp. rows) of X is given by Pu = UU~^ (resp. Py = W~^). The operator 
Vx ■ —)> given by Vx{Y) = PuY + YPy — Pu^Pv is the projector onto the linear 

space spanned by the matrices and yv^ for 1 < j,k < d and x,y £ M'^. The projector 

onto the orthogonal space is given by Vx{Y) = {I — Pu)Y{I — Py). If x is a vector then 
supp(x) stands for the support of x (indices of non-zero entries) and for another vector x' the 
notation (x')supp(a;) stands for the vector with same coordinates as x' where we put 0 at indices 
outside of supp(x). We use the same notation (^Osupp{X) for matrices X' and X. We also use 
the notation a V 6 = max(a, 6). 

Definition 1. Fix 6 = {p,A) where y, £ and A £ We define the constant k{6) such 

that, for any O' = (//', A') satisfying 


IKl* )supp(/^)^ llli'ii — ^11 (h- )sUpp(/.4) 111,'*) 


and 


ll(A)supp(yi)± 11^^^ + h||P^(A')||* < 3||(A')supp(^)||^^^ + 3f||P a(A')||*, 

we have 

II(I^0supp(m)I|2 V ||(A')supp{A)I|f V \\Va{A')\\f < K{0)\\Xg>\\T. (13) 

The constant 1/k{6) is a restricted eigenvalue depending on the “support” of 9, which 
is naturally associated with the problem considered here. Roughly, it requires that for any 
parameter O' that has a support close to the one of 0 (measured by domination of the ii norms 
outside the support of 0 by the ii norm inside it), we have that the Lfi norm of the intensity 
given by HA^'IIt can be compared with the Lfi norm of O' in the support of 6. 

Remark 1. Under some conditions on the possible set of values for the decay functions hjj>, one 
can prove that a stronger condition than the one considered here holds with a large probability, 
see Proposition 4 from [15]. This result is based on a careful analysis of the ergodicity properties 
of MHP. 

Theorem 1. Fix x > 0, and let 0 be given by (4), with tuning parameters given by (5), (6) 
and (9). Then, the inequality 

||Ag — A||y < inf III Ae — A||r + K(0)^^-||(u))supp(p)||2 + g ||(f^)supp(A)llF + rank(A)^ | (14) 

holds with a probability larger than 1 — 146e“^. 

Note that no assumption is required on the ground truth intensity A of the multivariate 
counting process N in Theorem 1. The proof of Theorem 1 is given in Section 8.2. Let us 


6 


observe that 


(^i^)supp(/.)ll2 < Iloilo max ^ Cl 

jesupp(ii) 


[x + \ogd + i,j{T))Nji[0,T])/T 


+ C2 


x + logd + 4,i(r)\2 
T ) 


where ||/i||o stands for the sparsity of p, that 




{x + 2 log d + Lxj^k{T))Vj^k{T) 


+ C2 


(x + 21ogd + fc(r))Bj_fc(r)\ 2 


where ||A||o stands for the sparsity of A, and finally that 


.2 < (x + logd + 4(T))||Fl(T)||op V ||F 2 (r)||op 


+ C2 


(x + log d + 4(r))(10.34 + 2.65 sup^gjo^T] \\H{t) || 2 ,oo) n 2 


where ci,C 2 > 0 are numerical constants. Hence, Theorem 1 proves that 6 achieves an optimal 
tradeoff between approximation and complexity, where the complexity is, roughly, measured by 


||^||o(x + logd) 

T 


max A^j([0, T])/r + 
j 


A||o(x + 21ogd) 
T 


max Vj^k{T) 

j,k 


+ 


rank( A) (x + log d) 
T 


Vi(T)||opV||F2(r)||op. 


This complexity term depends on both the sparsity and the rank of A. The rate of convergence 
has the “expected” shape (logd)/r, recalling that T is the length of the observation interval 
of the process, and these terms are balanced by the empirical variance terms coming out of the 
new concentration results given below. 


5 Data-driven matrix martingale Bernstein’s inequalities 

The proof of Theorem 1 requires a sharp control of the noise terms. Since we analyze both 
and trace-norm penalizations, we need control of this noise term for both the entrywise £oo norm 
and operator norm || • ||op. The concentration inequalities described below are of independent 
interest. The noise term is the matrix martingale Z[t) with entries 

Zj^k{t)= f f hj^k{s - u)dNk{u)dMj{s), (15) 

Jo J(0,s) 


where Mj{t) = Nj{t) — fg Aj(s)ds are the martingales obtained by compensation of the Hawkes 
process. A concentration inequality for Zj^k is easily obtained from Bernstein’s inequality [24], 
leading, for any x > 0, to 


^ , 2vx bx 

-{Z{t))j^k <\l— + i^ 


3t 
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with a probability larger than 1 — e ^ whenever 


1 

t 


{Zj,k)t 


1 

t 


u)dNk{u)^ 


2 

Xj{s)ds < V 


and 

sup / hj^k{s — u)dNk{u) < b. 

sS[0,t] J (0,s) 

A proof of this fact is implicit in the proof of Theorem 2 below. However, the predictable 
variation {Zj^k)t depends on the non-observed intensity Xj, so this inequality in present form 
is of no use for statistical learning. Morever, this result requires to know an upper bound on 
{Zj^k)t, while we would like an inequality that holds in general. 

Hence, we need a new Bernstein’s type inequality, that uses an observable empirical vari¬ 
ance term, based on the optional variation, instead of the predictable variation. The optional 
variation is given by see Equation (7) above, and is undersood as an estimation of {Zj^k)t- 
Let us consider also Bj^k{t) given by (7) and L^j^kit) given by (8). The next theorem gives a 
deviation bound on all the entries of Z{t). 

Theorem 2. We have 


for any 1 < j,k < d, with a probability larger than 1 — 30.55e“’^. 

The proof of Theorem 2 is given in Section 8.3 below, and has the same flavor as previous 
inequalities, see [13, 15]. 

Theorem 3 below gives a non-commutative version of Bernstein’s inequality for the noise 
term, namely a deviation for ||.Z(t)||op. It is based on a concentration inequality by the same 
authors [2], but it gives a bound with an observable variance term. We consider H[t) given 
by (7), Vi{t) by (10), V 2 {t) by (11) and 4(t) by (12). 

Theorem 3. For any x > 0, we have 


wzmop < {x+iogd+Lm\vi{t)\\opV\\v2{t)\\op 
t - ]l t 

^ (x -h logd-h 4(i))(10.34 -k 2.65supig[o^T] 11-^'(011 2 , 00 ) 
with a probability larger than 1 — 84. 

The proof of Theorem 3 is given in Section 8.4. This result of independent interest gives a 
control of the operator norm of the noise term, with an observable variance term. This is the 
first result of this kind to be found in literature, with [2] that gives a first Bernstein inequality 
for this kind of probabilistic object. 

Once again, let us stress the fact that in both Theorems 2 and 3, all the quantities controlling 
the noise terms are observable, and are used for a sharp data-driven tuning of the penalizations 
considered in Section 3. 
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6 Numerical experiments 


In this section we conduct experiments on synthetic datasets to evaluate the performance of 
our method, based on the proposed data-driven weighting of the penalizations, compared to 
non-weighted penalizations [36]. We generate Hawkes processes using Ogata’s thinning algo¬ 
rithm [27], with d = 100, baselines fi sampled uniformly in [0,0.1], hj^k{t) = with a = 1 and 
an adjacency matrix containing square overlapping boxes (corresponding to overlapping com¬ 
munities) at indexes 1:20, 10:50, 35:56 and 65:100. Each box is filled with uniformly sampled 
values in [0,0.2] and the rest of the matrix contains zeros. The matrix is then scaled to have 
operator norm equal to 0.8, therefore guaranteeing to obtain a stationary process. An instance 
of this matrix is given on the left side of Figure 2. We then compute several procedures on the 
generated data, restricting them on a growing interval of length 1000, 2000,3000,4000, 5000, 
and assessing their performance each time. An overall averaging of the results is done on 10 
separate simulations. Note that in this setting, the average number of events on a length 1000 
interval is on average equal to 10000, and is, by stationarity, linearly growing with the length 
of the interval. We consider a procedure based on mimization of the log-likelihood instead of 
the least-squares used above to derive the theoretical results. This allows to reduce greatly 
computation times, as the computation of a gradient can be done in parallel and is linear in the 
number of events and dimension, thanks to recursion formulas that can be used for exponential 
decays, see [28]. It can be seen that the data-driven weights used in our penalizations are the 
same when using the log-likelihood loss instead of the least-squares, as the noise term remains 
the same. This objective is convex, with a goodness-of-fit term locally gradient-lipschitz: we 
use first-order optimization algorithm, based on proximal accelerated gradient. Namely, we use 
Fista [4] for problems with a single penalization on A (fi-norm or trace-norm) and Prisma [29] 
for mixed and trace-norm penalizations on A. For both procedures we use a linesearch scheme 
that allows to tune automatically the gradient step at each iteration. We fix a maximum of 
100 iterations in all the results given below, for a fair comparison, we observed that it is largely 
sufficient for convergence to a satisfactory minimum. Note that in [36] an ADMM algorithm 
with Jensen’s maximization minimization principle is used, which is not accelerated, while our 
algorithms are. We compare the following procedures: 

• NoPen : direct minimization of the log-likelihood, with no penalization 

• LI: non-weighted LI penalization of /r and A 

• wLl: weighted LI penalization of /i and A given by (16) 

• LlNuclear: non-weighted LI penalization of /r and A, and trace-norm penalization of A 

• wLlNuclear: weighted LI penalization of /i and A given by (16) and trace-norm penal¬ 
ization of A 

Note that the procedure LlNuclear is the same as the one considered in [36], however we use a 
different optimization algorithm, based on an accelerated first-order method (that we expect to 
be faster than an ADMM based algorithm, although a careful comparison of solvers is beyond 
the scope of this paper). The data-driven weights used in our procedures are the ones derived 
from our analysis, see (5) and (6), where we remove negligible terms and where we put x = logT. 
Namely, we use 



and 



(16) 
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Figure 2: Ground truth matrix A ; recovered matrix using NoPen ; LI ; wLl ; LlNuclear ; 
wLlNuclear. We observe that wLl and wLlNuclear leads to better support recovery, as we 
observe less false positives outside of the node communities. 





Figure 3: Error for LI and wLl ; Error for LlNuclear and wLlNuclear ; AUC for LI and wLl. 
Abscissa corresponds to the interval length T. Weighted penalizations systematically leads to 
an improvement, both for LI and LI + Nuclear penalization, in terms of error and AUC 


for weighted £i penalization of /x and weighted penalization of A respectively. The tuning 
parameters ci, C 2 and the parameter for trace-norm penalization of A are tuned using cross- 
validation, on a testing error measured by the log-likelihood computed on a held-out testing set 
(we split in half the generated data for training and testing). We use two metrics to assess the 
procedures: 

• error: the relative £2 estimation error of the parameter 0, given by ||0 — ^Hi/ll^lli 

• AUC: we compute the AUC (area under the roc curve) between the binarized ground 
truth matrix A and the solution A with entries scaled in [0,1]. This allows to quantify 
the ability of the procedure to detect the support of the connectivity structure between 
nodes. 

In Eigures 2 and 3, we compare the procedures in terms of error and AUC. In Eigure 2 
we can observe, on an instance of the problem, the improvement of wLl and wLlNuclear 
with respect to LI and LlNuclear respectively, as we observe less false positives outside the 
node communities (better viewed on a computer). Eigure 3 confirms the fact that weighted 
penalizations systematically leads to an improvement, both for LI and LlNuclear, in terms of 
error and AUC. 

7 Conclusion 

In this paper we proposed a careful analysis of the generalization error of a MHP-based mod- 
elization of user interactions in a social network. Our theoretical analysis required a new con¬ 
centration inequality for matrix-martingales in continuous time, with an observable variance 
term, that is a result of independent interest. This analysis led to a new data-driven tuning 
of sparsity-inducing penalizations, that we assess on a numerical example. Eurther work will 
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focus on other matrix factorization techniques for this problem, such as non-negative matrix 
factorization, and the use of text-mining techniques to incorporate content features for twitter 
datasets for instance. 
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8 Proofs 

8.1 Notations 

Denote by X a, d x d matrix and x G Then diag[x] stands for the diagonal matrix with 
diagonal equal to x, while diag[X] stands for the diagonal matrix with diagonal equal to the 
one of X. We write the singular value decomposition (SVD) of a rank r matrix X as 

r 

X = U^V'^ = 

i=i 

where S = diag[cT(X)] with a{X) = [cti, ..., ar]^ the vector of singular values (Ti > • • • > of 
X and where U = [ui - ■ ■ Ur] and V = [vi - ■ ■ Vr] are n x r matrices with columns given by the 
left and right singular vectors of X. If X and Y are dxd, we denote by {X, Y) = tr(X''^'K) the 
Euclidean matrix product, and the Erobenius norm. We introduce the op¬ 
erator norm ||X||op = ai{X) and trace norm |A||* = ^ is a d x d matrix with 

positive entries, we introduce the weighted entrywise ^i-norm given by |A||i^vc = (LEj 
where |X| contains the absolute values of the entries of X. We denote by ||-^||o the number of 
non-zero entries of X and X 0 is the entrywise product (Hadamard product) of X and Y 
with matching dimensions. We use the same notation x Q y for vectors x and y with matching 
dimensions. We denote also X,^j for the j-th column of X while Xj^, stands for the j-th. row. 
We define 

||^|| 2 ,oo = max ||Xj ^,||2 and ||X||oo ,2 = max ||X,jj| 2 , 

3 j 

where || • ||2 is the £2 norm of vectors. If X = UYV~^ is the SVD of X the projection matrix 
onto the space spanned by the columns (resp. rows) of X is given by Pjj = UU~^ (resp. 
Py = VV^). The operator Vx : ^ given by Vx{Y) = PuY + YPy - PuYPy 

is the projector onto the linear space spanned by the matrices UkX~^ and yvj for 1 < j, A: < d and 
X, y G The projector onto the orthogonal space is given by Vx{Y) = {I — Pu)Y{I — Py)- 
If X is a vector (or matrix) then supp(x) stands for the support of x (indices of non-zero entries) 
and for another vector x' the notation [x%upp( 3 .) stands the vector with same coordinates as x' 
where we put 0 at indices outside of supp(x). We use the same notation [X']supp(x) for matrices 
X' and X. We also use the notation aV b = max(a, b). 

8.2 Proof of Theorem 1 

The proof is based on the proof of a sharp oracle inequality for trace norm penalization, see [19] 
and [18]. We endow the space x by the inner product {6, 9') = (/x, y') + (A, A') where 

9 = (/X, A) and 9' = {y',A') with (/x, = ijJy,' and {A, A') = tr(A'''A'). 
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For any 9, one has 


{VR^{e),e-9)=2 ^ (a,fc-a,-,) 


i<i<rf 


^ l<j,fc<(i 


dRT{e) 
da^ 




Since 






1 


dcij^k J{o,t) 


hj,k{t - s)dNkis), 


we have that the derivatives of the empirical risk are given by 

rT 


dRrie) _2 ' 

d/jj ~ T 


(^j Xjfi{t)dt-J dNj{t)J 


and 


dRrid) _ 2 / f r _ g-^(i]\^^(^g-^Xj^0{t)dt 

rT r s 

hj,k{t - s)dNkis)dNj{t)y 


dajx T 


10 J(o,t) 


Now, it leads to 

{vrtWJ -e) = ^Y r - fi,) 


d „T 


+ ^ E 




Y [ - dNj{t)). 

j=i do 

Using dMj{t) = dNj{t) — Xj{t)dt and the recalling that 


2 

f 


{f,9)T = ^ Y f fjit)9j{t)dt, 


we obtain the decomposition 


d 


{VRT{9),e -9)= 2(A^ - Ae, A^ - - ^YJ^ " ^Mt))dM,{t). 


Namely, we end up with 


d ,.rp 


2{Xg-Xg,X^-X)T = {VRT{9),9-9) + ^Y / - X,,e{t))dM,{t). (17) 


The parallelogram identity gives 


2{Xg - Xg, Xx - X)t — llAg - Ally + ||Ag - Aelly - HA^ - A||y, 


where we put ||/||f, = (/,/)t- 
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Let us point out that, in the case (Ag — Ag, A^ — X)t < 0, one obtains 

||Ag - X\\t < IIAe - Ally, 

which directly implies the inequality of the Theorem. 

Thus, from now on, let us assume that 

{Xg- X0,Xg- X)t >0. (18) 

The first order condition for 9 G argmin0{i2y(0) + pen(0)} gives 

-VRrie) G 9pen(0). 

Let 9q = —VRt{9). Since the subdifferential is a monotone mapping, we have {9—9, 9q — 9q) > 0 
for any 9q G 5pen(0). Thus from (17), one gets V6*a G clpen(0), 

2{Xg - Xg, Xg - X)t < -{9d, ^ ^ X] / “ Xj^g{t))dMj{t). (19) 

We need now to characterize the structure of the subdifferentials involved in pen(0), to describe 
Od- 


If 5'i(//) = Yl'j=i ^ 0 ) ''^6 have 

= |wOsign(;u)+UI0 / : ||/||oo < l,/rO/ = o|. (20) 

If g2{A) = J 2 i<j,k<d^j,k\^jx\, for Wj^k > 0 , we have 

dg2{A) = IW © sign(A) + W QF ■. ||F||oo < 1 , A 0 F = o|. ( 21 ) 

Let us recall that if A = U'SV~^ is the SVD of A, we have Va{F) = PxjF + BPy — PijBPy 
and V^{B) = (/ — Pu)B{I — Py) (projection onto the column and row space of A and 
projection onto its orthogonal space). Now, for 53(A) = r||A||*, we have 

553(A) = [fUV^ + fVi{F) : ||F||op < 1}, (22) 

see for instance [ 22 ]. Now, write 

(^9) ^ {d’d: A A*) (Agq, A A) {Aq .^, A A) 

with gg G 51(5), Ag^i G 52(A) and A^g^* G g^{A). Using Equation ( 20 ), ( 21 ) and ( 22 ), we can 
write 

-{9q,9 - 9) = -{wQs\gxi{g),jl- g) - {w Q f, ft - g) 

-{WQ sign(A), A - A) - (W © Fi, A - A) 

- r{UV^,A - A) - r(F„ Fi(A - A)), 

where by duality between the norms || • ||i and || • ||oo, and between || • ||* and || • ||op, we can 
choose /, Fi and F* such that 

{w Q f, g g) Wifi ^k)swpp(fl)^ II 10! 0 Fi, A A) || ( A A)g^pp^yj^^± II 
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and 


{F,,ViiA- A)) = \\Vi{A- A)\U, 


which leads to 


{Od^O 0 ) < Wifi A'')supp(/i) II (A /^)supp(p)-'- l|l,'!l) 

“1“ II (-^ “ ^)supp{A) II “ II “ ^)supp(A)-L II 1,14/ 
+ f||P^(A-A)||,-f||Pi(A-A)||,. 


Now, decompose the noise term of (19) : 


2 

f 


d 


E/ - X,,eit))dMfit) 

j=i do 


fJ-j) / dMjit) 


i=i 


l<j,k<d 


-\- \^ i^j,k ^j,k) I I s')dNf^(^s')dMj(t^ 

^ Jo J{ 0 d) 


— — 1^1 AIt) + — {A — a, Z), 


where 


dMi(t),. 


r'T 1 T 

dMfit) 


Mj' = 

and where we recall that Z is given by (15), see Section 5. We have 

d 

2|(/1 -/i,Mt)| < 2E l/^j “/^ill^i([0;^])l) |(^-^,-^)|< E l^i.fc “ ^i,fcll-^i,fcl 

j=l ^^jik<d 


and 


|(A A, Z’)| < ||Z’||op||A A||*, 


where we used again duality between trace norm and operator norm. 

We need now to use the concentration inequalities given in Section 5, that are proved in 
Sections 8.3 and 8.4 below. Using Theorem 2 (see Section 8.3 below) with h = 1 and an union 
bound on j = 1,..., d gives that 


i|M,([0.Tl)| < 2V2 fi- + + 4„(T))iV,(|0,r])/r ^ ^ + i,JT) 

for any 1 < j < d with a probability larger than 1 — 30.55e“*. Using Theorem 2 from Section 5 
entails that 


-|Z fc| < o-^/hA + 21ogd + L^j^k{T))Vj,kiT) 


+ 9.31 


(x + 21ogd + La;j,fc(r))Bj,fc(r) 
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for any 1 < j,k < d with a probability larger than 1 — 30.55e ^ and finally Theorem 3 (see 
Section 5) entails that 


|z(r)|| 


op 


T 


< M 


(x + iogd + 4(r))||Fi(T)||op V ||F 2 (r)||op 


+ 


(x + logd + 4(r))(10.34 + 2.65supjg[0_-r] ll-H'(i)l| 2 ,oo) 


with a probability larger than 1 — 84.9e Hence, the choice of weights (5), (6) and (9) entails 

\{A- A,Z)\< ^||A- 


and 


r, 


{A-A,Z)\<-\\A-A\ 


on an event with a probability larger than 1 — 146e ^. This entails 

0 < -{Oa, 0 - 4 + I E 

,=40 


< 


/^)supp(/i) II !,-!« 3^^^^ /^)supp(/i)-'- II Ijli 


w 


3 ^ 1 - 

T 2 II iA ^)supp(A) II ijiy 2^1 ^)supp(j4)-L II 

+ \f\\PA{A- A)\\,-\f\\Vi(A- A)\\.. 

Taking A = A gives a cone constraint on fi — fi: 

II (A “ /^)supp(/i)^ llli'ii — ^11 (A “ 4'')supp(/i) IlljUI) 
while taking /r = A gives a cone constraint on A — A: 

ll(^ “ ^)supp(yl)-L lli^vT + ^W'^Ai-A — A)|A 

< 3||(A — A)s^pp(-^)||^^.J^ + 3f||T’^(A — A)|A. 

Namely, we have now using Assumption 1 that 

IKA - f^)supp(Ai)l|2 V ||(A - A)supp(yi)|4 V WVAiA - A)||i;’ < K(6')||Ag - Aellr- 
Putting all this together gives 

— {Gdfi — d) + —{fl — fi, Mt) + — {A — a, Z) 


-k) 


supp(/i)-*- II 1,"^) 


(23) 


5 1 

— glKA “ /^)sUpp{/i) l|l,UI “ 2 I 

3 - 1 ^ 

T 2 ^^^^ ~ ■^)supp(yl) lli^^ “ 2^^^^ ~ II i.W 

+ \f\\PA{A- A)\\. - \r\\Vi(A- A)\\. 

5 3 - - 

— q II ('^)supp(/i) II 2 II (A ~ 4^)supp(/i) II 2 T ^ II (^^)supp(v4) I|f|| {a — A)sypp('yi) 14 


+ -f4rank(A)||PA(A - A)|4, 
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where we used Cauchy-Schwarz’s inequality. This finally gives 


||Ag-A||^<||A,-A||^-||A^-A,||^ 

+ k( 0) ^ — II ('U))supp(^) II2 + 2 II (l^)supp(^) IIF + 2 11^0 ~ Aq IIt’ 

where we nsed (23). The conclusion of the proof of Theorem 1 follows from the fact that 
ax — < a?/A for any a, x > 0. 


8.3 Proof of Theorem 2 

We want to control all the entries of the matrix Z given by 


Zj,k{t)= f f hj^k{s - u)dNk{u)dMj{s). 
Jo J(0,s) 


We use the next Theorem, which is based on Theorem 3 from [13]. 

Theorem 4. Let N be a counting process with predictable intensity A and compensator A. Let g 
be a predictable function bounded a.s. Put M = N — A the martingale obtained by compensation. 
Then, the martingale given by 


Zt= j g{s)dM{s) 


satisfies 


|■Z'(^)| < 2V2\J(x + ix{t))[Z]t + 9.31(x + ix 
with a probability larger than 1 — 30.55e“^, for H^Hoo = sup^g[o,t] lfl'('®)l ^^6 optional variation 


and for 


[Z]t = / gisfdN{s), 

Jo 


4(t) = 21oglog(5i!3gt45;P^vel. 


112x||5|| 


2 

oo 


We fix {j, k) € {1,..., and choose M = Mj, N = Nj 


g{t) = [ hj^k{t - s)dNkis) 

J{o,t) 


in Theorem 4, which leads to 


i|Z,-fe(r)| < 2V2J + ^^AiT))Vj,k{T) ^ g 3 ^ (x + T,-,fc(t))Bj,fc(T) 


with a probability larger than 1 — 30.55e“^ for any j,k. Now, using an nnion bound over 
{j,k) G {l,...,d}^ with this ineqnality gives the same inequality with the same probability, 
where we increase x by 2 log d. □ 
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8.4 Proof of Theorem 3 
8.4.1 Notations 

Let us introduce 


= diag[X(t)X(t)^] and = diag[X(t)^X(t)], 

where we note that 

^xi^) = diag [||Xi,,(t)|||, • • • , \\Xd,,it)\\l] and 
^^)(t) = diag[||X.,i(t)||i,... ,||X.,,(t)||i]. 


(24) 


(25) 


8.4.2 Preliminary results 

Let us introduce a process of the form 

UA,B{t)= f Asdiag[dMs]Bs, (26) 

Jo 

where {At}t>o and {Bt}t>o are arbitrary (J^t)-predictable d x d processes, so that the entries 
of UA,B{t) are given by 


d rt 


{U A,B{t))i,j = yZ [ i^s)i,k{Bs)kj{dMs)k- 
fe=i-^o 

Recalling that Ht is the matrix with entries Hj^k{t) = /(ot) — s)dNk{s), see (7) and that 


we have 


Zj,k{t)= / / hj^kis - u)dNk{u)dMj{s), 

Jo J(0,s) 


Zt= [ dmg[dMs]Hs = Ui.H{t). 
Jo 


Let us recall that we want to control \\Z 
herein, and is a core ingredient for the proof of Theorem 3. 


op. The next Theorem is given in [2], see Theorem 4 


Theorem 5. Let UA,B{t) be given by (26) with Mj{t) = Nj{t) — Xj{s)ds that are martingales 
obtained by compensation of the counting processes Nj for j = 1,... ,d. Define the matrix 


VA , B , x { t )= / ||A(s)||^,2||B(s)|||ooWa,b,a(5)c?s, 

Jo 


where 

WA,BAt) = 
and introduce also 


Atdiag[A^ At] Miag[At]A4 0 

0 BJ disig[BtBj]~^ diag[Xt]Bt 


bA,B{t) = sup ||A(s)||oo,2||.B(s)||2,< 
se[o,t] 


Then, for any v,x > d, the following holds: 


bx 


\UA,B{t)\\op > + y, bA,B{t) < b, Xma.x{VA,B,\{t)) < V 


< 2de 


(27) 


( 28 ) 
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An immediate corollary of Theorem 5 is given in Corollary 1 below. For 0 < vi < V 2 and 
0 < 5i < 62 , we introduce the events 

Vvi,v 2 = < ||^A,B,A(i)||op < ^ 2 } and Bb^^b 2 = {h < hA,B{t) < b 2 }- 

These events give lower and upper bounds of the random quantitives involved in this theorem. 
A peeling argument, given below, will allow to remove these events from the concentration, by 
slightly enlarging the concentration bound by a poly-logarithmic factor. 

Corollary 1. Fix any €,b,v > 0 and x > 0. The following deviation inequalities hold. 


\Ua,b 

\Ua,b 

\Ua,b 

\Ua,b 


op ^ ^ 1^0,w C Bo b 


< 2de-^, 

op > y^ 2(1 -|- e)|| Vyi,B,A(i)||op 3 ; + -x n n Bq^ 


< 2de 


op 


op 


b,il+e)b 


< 2de 


> -^^2(1 -I- e)vx + X n Vo,,; n Bf 

O 

^ (1 + A,B,x{t)\\opX -\ ’ ^ ® n n Bb^(l+e)b 


(29) 

(30) 

(31) 

< 2de-^. 

(32) 


8.4.3 First concentration inequalities for ||Zi||op 

As explained above, if we choose A = I and B = H, we have UA,B{t) = Zt- For this choice, 
we have 

= bH{t) = sup 11 ^^( 5 ) 112,00 
ss[ 0 ,t] 


and 


where we defined 




y hxA) 0 

0 Vh,\,A) 


VHXi{t)= f \\H{s)\\l^Dx{s)ds 

Jo 

VHX2{t)= f \\H{s)\\l^H{sAA;j\s)-^Dx{s)H{s)ds. 
Jo 


Note that 

ll^/,ff,A(^)l|op = ||^ff,A,2(i)||op V ||'Vjf^A,2(i)||op- 

We will denote for short {t is fixed throughout) 


Z=\\Zt\\op, Vi = ||Wff,A,l(t)||op, V 2 = ||'Vif,A,2(i)||op, B = bH{t) 
until the end of the proof. Introduce also, for vi,V 2 ,bi,b 2 > 0, the events 

Vill, = {vi<Vi<V2}, Vi%^ = {vi<V2<V2}, Bb„b2 = {bi<B<b2}. 
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Corollary 1 entails the following inequalities. 

z > ^ n n n 


< 2de 


z > \/2(l + t)(Hv t,j),i- + 5;^ n n V® n S„,i 


z > \/2(l + f)(..iVFj),i- + h n v<“ n n S„,i 


bx _ ,.(i) 


Z > V2(l + e)(yi VCa)® + y n V, 


;( 2 ) 


2;i,(l+e)2)l ^ '^V2,{l+e)v2 ^ 


< 2de-^, 

< 2de-^, 

< 2de-^, 


Z > \/2(l + e)vx + — n n ^ ^b,(i+' 


( 1 ) 


;( 2 ) 


e)b 


< 2de 


Bx 


z > (1 + eW 2 {v,yv 2 )x + ^ n n v^% n eb,(i +.)66 


;( 2 ) 


3 

Bx 


< 2de 


z > (1 + £)v'2(„iV 1/2)1 + n n n ej,,i+. 


;{ 2 ) 


3 

Bx 


e)b 


< 2de 


z > (1 + £)V2 (Uvi/ 2)+ + n n n 


;{ 2 ) 


< 2de 


(33) 

(34) 

(35) 

(36) 

(37) 

(38) 

(39) 

(40) 


These inequalities looks like the desired Bernstein’s inequality. But they have two major prob¬ 
lems. First, the variance terms Vi and V 2 depend on A, hence are non-observable. Second, 
we need to remove the events and B. to end-up with a usable inequality. Natural 

estimators of and VH,x, 2 (t) are given, respectively, by 


VHAt)= f \\H{s)\\l^diag[dNs] 

Jo 

VH, 2 {t)= / \\H{s)\\l^H{syJZi\s)-Umg[dNs]H{s). 
Jo 


We introduce for short 

= l|VK,l(t)||op, V2 = ||Vif,2(i)||op. 

The next step is to prove that we can replace the non-observable variance terms Vi and V 2 , 
that involve the quadratic variation, by the observable variance terms Vi and V 2 involving the 
optional variation. 


8.4.4 Replacing V 2 by V 2 

First, note that 

VHX2{t) = f \\H{s)\\l^H{sy&ij\s)-^D^{s)H{s)ds 
Jo 

= VH,2{i)- [ \\His)\\l^H{s)^^P{s)-Uiag[dM,]H{s) 
Jo 

= VH,2{t) - UqT Qit), 

where 

Q{t) = \\Hit)\\2,oc^S’it)~"^^H{t). 

Hence, we use again Proposition 1 with A = and B = Q. Note that 

&QT,qW = sup ||Q^(s)||oo, 2 ||Q(s)||oo ,2 = SUp ||Q(s )||^_2 
sS[0,i] sS[0,t] 

= sup ||Pf(s)||i,ooll^H (s)"^/^-H'(s)|||ooi 

sg[0,i] 
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but using (25) gives 


= max ||(<)(t)-V2//(t)) . J|2 

j 

= max||||i/j-,(t)||2^ii'j,,(t)||i = 1, 


so that 


Moreover, note that 


= bH{tf = sup \\H{s)\\l 
se[o,t] 




so that 




Recalling (??) and (27), this means that 


^qt,q,a(*) “ 


0 y 


Hence, Proposition 1 gives, for the choice A = Q and B = Q 

bnitY 


\U^ 


Q ,Q 


op > (1 + ^)\J‘^\\y H,X,2it)\\opX H- - - X, 


V < II Vf/,A,2(i)||op < (1 + e)v, I? < bnitf < (1 + e)6^ 


< 2de 


We obtain that with a probability larger than 1 — 2de we have 

ll^ff,A,2(i) — yH,2{t)\\op < (1 + e)'y2||Ff/^A,2(^)||opa; + ^ X 

on the event {u < || Vff_A, 2 (i)||op < (l + e)u}n{6^ < 6f/(t)^ < (l + e)^^}. So, using the so-called 
“square-root trick”, namely the fact that H < 6 -|- \/aA entails H < a -|- 26 for any a,A,b > 0, 
we obtain 

Ps < 2V2 + 2((l + e)2 + 6jf(t)V3)x (41) 

and 

V2 < 2P2 + ((l + e)V2 + ^//(i)V3)x (42) 

on this event, with a probability larger than 1 — 2de~^. 


8.4.5 Replacing Vi by Vi 

The exact same strategy as for P 2 is used. We write 

yH,x,i{t) = yH,i{t)-UQT^Q{t), 

where this time Q = 11^(011 2 ,ooI- We have 

^QT,Q(i) = bnitf = sup ||Pf(s)||i 
se[o,t] 


and 


V 


Q ,Q,A 


it) 


yH,x^it) 0 

0 y H,x,iit) 
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Hence, we use again Proposition 1 with A = B = Q, which gives 


- ^/f,l(i)||op < (1 + H,X,l{'t)\\opX + ^ X 

on the event {u < || ^if,A,l(i)||op < (1 + e)u} n {6^ < bnity < (1 + Now, using the same 

trick as before, we obtain 

Vi < 2Vi + 2{{l + ef+ bH{tf/^)x ( 43 ) 

and 

Vi<2Vi + {{l + ef/2 + bH{tf/2,)x ( 44 ) 

on this event, with a probability larger than 1 — 2de~^. 


8.4.6 Concluding the proof 

Plugging Equations (41) and (43) with Equations (33)-(40), we can replace Vi and V 2 by Vi 
and V 2 respectively. Now, we use a peeling argument to remove the events that lower and upper 
bound Vi, V 2 and B. Fix e,co,x,6o > 0 and introduce 

xij = V2J = cox(l + ey and bj = bo{l + eY 


for j > 0, and take 

Put again for short Z 
introduce the events 


t"!,-! = V 2-1 = b-i = 0. 


ZtWop, Vi = ||Ffi-_A,i(i)||op, V 2 = ||T^fi‘,A, 2 (i)||op and B 


bnit), and 


Vi = {Vi > upo}, V 2 = {V 2 > U 2 ,o} and B = {B > bp} 
We partition the whole probability space in the following way: 

(vfuVi)n(V2^uV2)n(H^uH) = |J VijnV2,fcnHp 


where 


= {^1,1 < V2,fc = {v2,k < P 2 < V2,k+i}, Bi = {bi < B < bi+i}. 

On each event Vij- n V 2 ,fc H H;, we have a deviation on Z. Using (33) gives 


^ > iV^ + j)x n Vi,_i n V 2,-1 n 


< 2de 


(45) 


Using (34) with (43), together with the fact that on this event, V 2 < ^^ 2,0 = ^^ 1,0 < xij < Ui, 
gives 


P Z > Y 2ci,eViX + (C2,e + C3^ebo)x O Vl j D V2-I O B-i 

for any j > 0, where we introduced the constants 

Cl,e = 2(1 + e), C 2 ,e = 2(1 + C 3 ^e = 2\j ^ 


< 4(ie 


(46) 


21 











Using (35) with (41), together with the fact that on this event, Vi < ni,o = ^ 2,0 < 1 ^ 2 ,fc < V 2 , 
gives 


Z >y 2ci^eV2X + (C2,e + C3^ebo)x Pi Vl-1 P V2,k P ^-1 


< 4de 


for any k >0. Using (36) with (41) and (43) gives 


(47) 


z > y 2 ci,e(Ui V U2)rE + {C2,e + C3^ebo)x P Vlj P V2,fc P ^-1 


< Me 


(48) 


for any j, k >0. Using (37) gives 


Z > (-\/2(l + e)co + —)x P Vi,_i P V2,-i P Bi 


< 2de 


(49) 


for any I > 0. Using (38) with (43), together with the fact that on this event, V 2 < ^^ 2,0 = ^^ 1,0 < < hi, 

gives 


■Z' > Ci^eVViX + (C 4 ,e + C 5 ^eB)x P Vlj P V 2 ,-l P Bi 


< Me 


(50) 


for any j, I > 0, where we introduced the constants 


C 4 ,e — 2(1 + e)^, C5^e 


2(l + e) , 1 

v/3 3- 


Using (39) with (41), together with the fact that on this event, Vi < ni,o = ^ 2,0 < 1 ^ 2 ,fc < V 2 , 
gives 


> C1MV2X + (c 4 ,£ + C 5 ^eB)x P Vl -1 P V2,fc P Bl 


< Me 


(51) 


for any k,l > 0. Using (40) with (41) and (43) gives 


Z > Ci^gU (Vl V V2)x + (C 4 ,e + C5^eB)x P Vlj P V 2 ,fe P Bi 


< Me 


(52) 


for any j,k,l > 0. Taking the largest term upper bounding Z in these inequalities, we obtain 
that 


z > CpeU (Vl V V2)x + (C6,e + C5^^B)x P Vlj P V2,fe P Bi 


< Me 


for any j,k,l > —1, where we introduced 

C6,e = y/ 2(1 + e)co + bo/3 + 2(1 + e)^. 


(53) 


So, Z is controlled by an observable term in all cases. It remains to remove all the events 
Vi,j P V 2 ,fc P Bl for j,k,l > —1. This is done by increasing x by a very small observable term, 
and by using an union bound on all the possible combinations j, k,l > —1. Introduce for some 

Q > 0 


i = ce{ log log 


2Ui + 2((l + e)2 + 5V3)x 


Ve 


CQX 


, ^ /2U2 + 2((l + e)2 + 52/3)x \ , fB \\ 

+ log log (^- — -V ej + log log (^— V e j j. 
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Note that ^ > 0. Now, write 


Z > Cl,, 


'^3,k,h 


where 


^3^1 = 


z > Ci^eV (hi V V2)ix + t} + (C6,e + C^,eB){x + £) fl Vlj H V2,fc H S; 


and decompose in the following way 


yy =ip-i,-!,-!+ y^ iPj,-!,-!+ip-i.fe,-!+ y^ 

j,k,l>-l j>0 k>0 l>0 

+ ^3,k-l + y^ ^3,-1,1 + y^ ^ ^3,k,l- 

3,k>0 j,l>0 k,l>0 3,k,l>0 


For P_i,_i,_i, we use the fact that ci,ey (Vi V V 2 )(x + t')+(c 6 ,e+C 5 ,ei?)(a:+^) > cg^eX > (\/2co+bo/3)x, 
and then (45) to obtain that 

IP_i,_i,_i < 2de ^. 

For Pj,_i,_i, we use (43) to get that on Vjj n V 2 ,-i H B-i 


Cl,e\l{Vl V V2){x + i) + (C6,e + C^^eB){x + tj 

> \J2ci^J/l{x + + {C2,e + C3,ebo){x + if'*), 

where we put 

if^ = Cl log log (f^ V = log {{j log(l + e) V 1)'"^). (54) 

So, using (46), we obtain 

y]]Pj-_i,_i < 4(ie-^ ^ =4(ie-^(^l + log(l + e)=^y^j-'=^). 

i>o j>o j>i 

We obtain in the exact same way using (41) and (47) that 

y^ P_i,fc,_i < 4de-" (l + log(l + e)'=^ y^ , 

k>0 k>l 

and also 

P- 1 ,- 1 ,/ < 4de-"^ (l + log(l + 6)^=^ Y 

l>0 1>1 

using (49). For Fj^k-i, we use (48) with (41) and (43), together with the fact that on 
Vij n V 2 ,A: n B-i we have 

Cl,e\/{Vl V V2){x + i) + (C6,e + C^^^B){x + i) 

> ^j2ci^,{ViyV2){x + if +iff + {C2,e + C3,.6o)(x + if + if). 

This gives 

Y -1 - ^ ^ yz ■ 

j,k>0 i>0 fc>0 3>l 
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We obtain in the same way, using (50), (51), (52) and (41), (43), that 



< 

Me 

k,l>0 




< 

Ade 

j,l>0 




< 

Ade 


j.,kyl>0 


(i + iog(i + e)=^ 

i>i 

i>i 

(l + log(l + e)=^^r"^)'. 

i>i 


Put = 1 + log(l + eY^ Ylj>i j We finally have that 


— 2(^ 4“ 6(c£^g + Qcj Y) 4" 2c|,,)cie 

Now, choose e = cq = 6o = 1 and q = 2. For this choice we have 2(l+6(c£_e+6c^^)+2c|g) < 84.9, 
ci^£ = 4, = 4/\/3 + 1/3 < 2.65 and C6,e = 2 + 1/3+ 8< 10.34. This concludes the proof of 

Theorem 3. 
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